from IPython.display import HTML
HTML('''<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js "></script><script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
} else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);</script><form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>
''')
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import os
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from glob import glob
This report explores the utilization of big data to advance biological research by harnessing the power of the Global Biodiversity Information Facility (GBIF) dataset. The GBIF dataset provides open access to biodiversity data collected from governments and organizations worldwide. However, the dataset suffers from missing values in different fields, making analysis challenging. To address this issue, we propose a solution that utilizes Amazon's cloud computing capabilities and supervised machine learning to predict null values. The idea is that if the predictions are accurate enough, then they can be used as proxies to the null values.
The study achieved high accuracy in predicting the establishment field, with taxonomic order and country code identified as the top predictors. To address multicollinearity concerns with taxonomic ranks, models were created separately for each rank, yielding satisfactory metrics. Taxonomic kingdom produced accuracy, precision, and recall scores above 0.95. Predicting latitude and longitude proved to be more challenging, with mean absolute errors (MAEs) of 3 and 6 degrees, respectively, and higher standard errors (SEs). We compared the predictive power of taxonomic ranks to the overall top predictor (country) and found that while country had lower MAEs for latitude and longitude, the kingdom showed lower standard errors. Kingdom also exhibited lower standard errors in predicting depth and elevation, but with the MAEs now being more comparable to the baseline compared to latitude and longitude, the study suggests the potential of kingdom for depth and elevation prediction. To enhance the study, it is important to address biases in the data, particularly the geographic bias towards participating countries and taxonomic bias towards birds. Additionally, we recommend applying the developed pipeline to other fields in the GBIF dataset, expanding its utility and improving data completeness.
The field of biodiversity research has long been a cornerstone of scientific exploration, as researchers strive to unravel the intricacies of our planet's ecosystems and understand the multitude of species that inhabit them. The rich tapestry of life on Earth, from the smallest microorganisms to the largest mammals, has fascinated scientists and fueled their curiosity for centuries. Over the course of history, numerous institutions, organizations, and passionate individuals have dedicated their efforts to studying and documenting biodiversity, resulting in a wealth of valuable information.
However, despite the collective efforts of the scientific community, this knowledge remains fragmented and dispersed across various sources. The data on biodiversity is scattered in research publications, government reports, museum collections, field notes, and countless other repositories. Accessing and integrating this vast amount of biodiversity data poses a significant challenge, limiting the potential for comprehensive analysis and hindering scientific advancements.
In response to this challenge, the Global Biodiversity Information Facility (GBIF) was established as an international initiative [1]. Founded in 2001, GBIF aims to provide open access to biodiversity data collected from governments, research institutions, and organizations worldwide. It serves as a global network, facilitating the sharing and integration of biodiversity data to enable researchers, conservationists, and policymakers to access and utilize comprehensive information.
GBIF has created a centralized platform that aggregates and standardizes biodiversity data, making it accessible and interoperable. The organization works with data providers to ensure that their datasets adhere to the Darwin Core data standard, which serves as the core framework for structuring and organizing biodiversity data. By following the Darwin Core standard, data providers can ensure that their datasets are compatible with the GBIF network and can be easily shared and integrated with other biodiversity datasets [2].
The importance of GBIF and its efforts to provide open access to biodiversity data cannot be overstated. By bringing together data from diverse sources and making it freely available, GBIF enables researchers to address critical questions about species distributions, population dynamics, ecological interactions, and the impacts of environmental changes. It facilitates collaborations across borders, disciplines, and sectors, fostering a global community dedicated to advancing biodiversity research and conservation.
Since its establishment, GBIF has grown into a global collaborative effort involving more than 120 countries (Figure 1). The GBIF network comprises not only governments and research institutions but also numerous non-governmental organizations (NGOs) and citizen science initiatives. This broad participation demonstrates the commitment of the international community to sharing and utilizing biodiversity data for the benefit of science, conservation, and sustainable development.

One significant challenge in leveraging the GBIF dataset is the presence of missing values. The data gaps can arise due to a lack of standardized data collection protocols, spanning centuries and incorporating contributions from different sources. Additionally, concerns about illegal activities, such as poaching, may lead researchers to withhold precise occurrence locations, further contributing to data incompleteness.
The motivation behind this report is rooted in the desire to unlock the full potential of the GBIF dataset for advancing biodiversity research. By addressing the challenge of missing values, we aim to enhance the dataset's completeness and accessibility, enabling researchers to derive deeper insights into the intricate workings of ecosystems and species dynamics. The availability of comprehensive biodiversity data is crucial for understanding the impacts of environmental changes, tracking the spread of invasive species, monitoring disease outbreaks, and implementing effective conservation strategies. By improving the completeness of the GBIF dataset, we can empower scientists, environmentalists, and decision-makers to make informed decisions and take proactive measures to safeguard our planet's ecosystems.
Various scientific articles in recent years emphasize the significance of biodiversity. Rand et al. underscore the pressing requirement for effective biodiversity conservation, advocating the integration of conservation into policies, securing sufficient financing, and driving transformative changes at institutional and societal levels [3]. Cardinale et al. express concerns about the ramifications of biodiversity loss on ecosystem functioning and the capacity to sustain essential benefits for human well-being [4]. Dasgupta emphasizes the necessity to transition from GDP as a progress indicator to a national Wealth measure that encompasses Natural Capital, stressing the importance of incorporating biodiversity in economic assessments [5]. Thus, there is a growing need for a unified database to advance biodiversity initiatives.
The findings and recommendations of this report seek to provide a pathway for leveraging big data and machine learning in biodiversity research, illustrating the potential of these technologies to unlock valuable insights from vast and complex datasets. By bridging the gap between scattered biodiversity information and actionable knowledge, we aim to catalyze scientific advancements and contribute to the preservation and sustainable management of Earth's rich biodiversity. This leads to the problem statement below.
"How can we improve the GBIF dataset's fields for the benefit of future bio researchers?"
The Global Biodiversity Information Facility (GBIF) is available through the publicly available Registry of Open Data of Amazon Web Services (AWS) [6]. The dataset consists of over 2 billion occurrence records, encompassing a wide range of species and geographic locations [7].
The dataset includes fields related to taxonomy, place of occurrence, and time of occurrence, among other details. Figure 2 shows the schema. The taxonomy information provides classification and categorization of species, including their scientific names, common names, and hierarchical ranks such as kingdom, phylum, class, order, family, genus, and species. The location information includes latitude and longitude coordinates, as well as depth and elevation, which pinpoint the occurrence locations of species. The dataset also includes additional details such as establishment (i.e. whether the observed species was native or non-native to the place of occurrence), and other attributes pertaining to how the occurrences were recorded.
Natural Earth dataset was utilized in conjunction with the GBIF dataset to enhance visualization capabilities. The comprehensive Natural Earth dataset encompasses a wide range of geospatial information, including political boundaries, landforms, physical features, cultural and populated places, coastlines, rivers, lakes, and more. This dataset is specifically designed to be easily accessible and highly adaptable for various mapping and GIS (Geographic Information System) applications [8].


To overcome the challenge of missing values, we propose a solution that leverages the capabilities of Amazon's cloud computing and employs supervised machine learning techniques. Through predictive modeling, we aim to fill the gaps in the dataset by accurately predicting missing values based on available information. The high-level methodology is shown in Figure 3.
Data collection. The data collection process for the GBIF dataset involved accessing the dataset from the AWS Registry of Open Data using the PySpark library in an AWS Elastic Map Reduce (EMR) environment. The configured EMR cluster consists of 1 primary node and 3 core nodes with no task nodes. Each node consists of 4 cores and 16GiB memory. This distributed setup enabled efficient and parallel processing of the dataset, making it well-suited for handling the large volume of data.
Data exploration. The dataset was thoroughly examined to gain insights into its various fields. Data profiling techniques were employed to assess the dataset's structure and quality, including analyzing missing values, data types, and inconsistencies, enabling subsequent data cleaning and preprocessing for enhanced data integrity and reliability. Bar graphs, pie charts, and other visual representations were used to depict the distribution and composition of the data across different categories or variables. Through these exploratory techniques, key insights and observations were obtained, revealing important trends, relationships, and potential areas of interest within the GBIF dataset. The findings from data exploration laid the foundation for subsequent data cleaning, modeling, and evaluation steps.
Data cleaning. The dataset was prepared by creating subsets based on specific criteria relevant to the project's focus. The project specifically concentrated on the following fields: kingdom, phylum, class, order, country code, decimallatitude, decimallongitude, depth, elevation, day, month, year, and establishment means. Excluding lower taxonomic ranks with a large number of classes helped prevent overfitting in predictive models. It should also be noted that since these lower taxonomic ranks are still fluid, creating models based on them might render the models inaccurate for future use. Other fields were not considered as significant, but their inclusion was noted in the Recommendations section for future work. Among the chosen important fields, the team chose five fields to predict: establishment means, decimallongitude, decimallatitude, depth, and elevation as these were the most underfilled fields and would provide the most benefit when imputed. The null values of each of these five fields were removed in preparation for model training. For the field of establishment means, oversampling was employed to handle class imbalance.
Model training. Each subset of the GBIF dataset was trained using MLlib's random forest algorithm, a machine learning library in PySpark that supports distributed training and evaluation. Due to time constraints, only the default max_depth of 5 was used for the models which determines the maximum depth of decision trees in the random forest. To address multicollinearity concerns, subsets of specific fields were used as features, resulting in separate models based on country code, individual taxonomic ranks, or a combination of taxonomic ranks with decimallatitude and decimallongitude. This approach allowed for the creation of multiple models per target variable, enabling thorough comparison and evaluation.
Model evaluation. The best models were selected based on established metrics for classification and regression tasks. For classification, accuracy, precision, and recall were employed as evaluation metrics. Mean absolute error and standard error were used as metrics for regression. To assess the predictive power of each feature for each target variable, two approaches were taken. First, the team examined the feature importance derived from the model that included all features. Second, the performance of models trained on specific features was analyzed to determine their individual contributions.
The following descriptive plots were collected, explored and cleaned in an AWS EMR PySpark cluster environment due to the size of the data. This processing is documented in bdcc-project-descriptive.ipynb. Summarized details important to each plot were saved into json files, which can be found in the df directory.
def read_json_files(folder_path):
"""
Reads and combines JSON files.
"""
# Initialize an empty list to store the data from each file
data_list = []
# Iterate over each file in the folder
for filename in os.listdir(folder_path):
if filename.endswith('.json'):
# Construct the full file path
file_path = os.path.join(folder_path, filename)
# Read the JSON file and append the data to the list
with open(file_path, 'r') as file:
json_data = file.read()
data = pd.read_json(json_data, orient='records', lines=True)
data_list.append(data)
# Concatenate the data from all files into a single DataFrame
combined_data = pd.concat(data_list, ignore_index=True)
return combined_data
# Read the files
establishment = read_json_files('df/df_establishment')
coord = read_json_files('df/df_coords')
year = read_json_files('df/df_year')
df = pd.read_json('df/taxon.json', lines=True)
fig = px.sunburst(df.dropna(), path=['kingdom', 'phylum', 'class'],
values='count', color_discrete_sequence=['orange', 'green'])
fig.update_layout(
title='Taxonomy Distribution',
height=800,
plot_bgcolor='rgba(0, 0, 0, 0)'
)
fig.show()
The taxonomic bias in GBIF observations is primarily influenced by the availability of data to the observers. This bias is evident when examining the distribution of observations across taxa, as seen in Figure 4. The graph clearly indicates that a significant proportion of the observations are concentrated in the Animalia kingdom, specifically within the Aves, which comprises birds. This pattern is a direct result of the substantial contribution of bird datasets to the overall data pool.
The underlying spatial bias is due to accessibility and interest of observers. Birds, as compares to other creatures, attract considerable amount of attention, hence more observation/records.
fig = px.bar(establishment.dropna().sort_values(by='count'),
orientation='h',
y='establishmentmeans',
x='count',
color_discrete_sequence=['#00af6b'])
fig.update_layout(
title='Establishment',
height=800,
plot_bgcolor='rgba(0, 0, 0, 0)'
)
fig.show()
The establishment bias in biodiversity data collection and recording is characterized by a preference for native species, as seen in Figure 5. This bias may arise due to various factors such as research focus, funding priorities, observer bias, identification challenges, and data availability and accessibility. Native species receive more attention and resources in biodiversity studies and conservation efforts, leading to an overrepresentation of their occurrence records in databases like GBIF. Observers may also exhibit a bias towards recording native species due to their familiarity and interest.
# Create the plot using Plotly Express
fig = px.bar(year, x='year', y='count', title='Yearly Count', color_discrete_sequence=['#00af6b', 'black'])
fig.update_layout(
title='Yearly Occurences',
height=700,
plot_bgcolor='rgba(0, 0, 0, 0)',
font=dict(size=15)
)
# Display the plot
fig.show()
The dataset also suffers from temporal bias, where the distribution of data is uneven across time period, as seen in Figure 6. The primary reason for this is the lack of availability of historic data that has been digitized. Most of the data of occurrences that is in the GBIF has been documented from the past 20 years.
Note that the date of an occurrence is not always the date in which the data is added to the GBIF dataset. The occurrence is dependent on the date of the actual sighting or mention of a living creature, which may be belatedly discovered in media like texts or images.
coord = coord.round(1).drop_duplicates()
fig = go.Figure(go.Scattergeo(lat=coord['decimallatitude'],
lon=coord['decimallongitude']))
fig.update_traces(marker_size=1, line=dict(color='Black'))
fig.update_geos(landcolor="#00af6b")
fig.update_layout(height=500,
title='Location of Occurrences Map')
fig.write_html("3d_plot.html")
fig.show()
Based on Figure 7, there is a geographic bias in occurrences around GBIF's participant countries. Most prominent of these are North America, Western Europe, Australia, Portions of South America, and portions of Eastern and South Africa. There are however exceptions to this, such as non-participating countries like Japan and the Philippines contributing portions of data.
Location of Establishment with Occurrences

As a point of interest, Figure 8 shows a sample of occurrences alongside the type of establishment at that point. Native establishments, colored in black, are overlayed on top of Non-native establishment, colored in red. There are no noticeable boundaries that can be drawn between exclusively black or red areas, they all seem to cohabit the same areas. We cannot say that latitude and longitude alone will reasonably predict establishment. The code for this plot was done in an AWS EMR cluster and can be found in bdcc-project-descriptive.ipynb
The codes for predicting establishment means is located in the .ipynb file bdcc-project-estmeans.ipynb.
The fields selected to train the model were kingdom, phylum, class, order, country code, decimallatitude, decimallongitude, day, month, and year. The fields depth and elevation were dropped because they will significantly reduce the number of observations in the dataset. Dropping the irrelevant fields and null values reduced the dataset to 13453488 observations. As mentioned in the Methodology and Data Exploration, the minority classes (vagrant, introduced, reintroduced, uncertain) were combined into a single class (non-native) and the class imbalance of the training test was treated using oversampling. The class imbalance of the test set was left untreated.
Table 1 shows the test accuracy, precision, and recall of 5 different models that utilizes different sets of features: all features, kingdom only, phylum only, class only, and order only. The recall was high for all models since there are only a few negatives (non-native) in the test set.
| Features | Accuracy | Precision | Recall |
|---|---|---|---|
| All features | 0.830 | 0.989 | 0.830 |
| Kingdom only | 0.950 | 0.986 | 0.961 |
| Phylum only | 0.860 | 0.952 | 0.899 |
| Class only | 0.840 | 0.991 | 0.839 |
| Order only | 0.731 | 0.989 | 0.725 |
estmeans_feat_imp = [('phylum_idx', 0.017797044406777037),
('kingdom_idx', 0.08924157122357632),
('class_idx', 0.03859461371355322),
('order_idx', 0.5672940964397347),
('countrycode_idx', 0.24607812980027127),
('decimallatitude', 0.007771979465193949),
('decimallongitude', 0.01416604506225753),
('year', 0.018388693797088268),
('month', 0.000662138950693343),
('day', 5.687140854502852e-06)]
df_estmeans_feat_imp = pd.DataFrame(estmeans_feat_imp,
columns=['feature', 'score'])
px.bar(df_estmeans_feat_imp.set_index('feature').sort_values(by='score'),
x='score',
orientation='h',
template='plotly_white',
title='Establiment Means Feature Importances',
color_discrete_sequence=['orange', 'green'])
Figure 9 shows the feature importance for the model that utilized all the features. Noticeably, the top predictors were taxonomic order and country code. A possible problem with this feature importance is that, since the taxonomic rank is a subset of the rank above it (except kingdom which is the highest rank in the dataset), then the importance of the lowest rank included as a feature which is order might have been inflated due to multicollinearity. This was the motivation for creating models using the individual taxonomic ranks. Surprisingly, kingdom alone yielded exceptional values for accuarcy, precision, and recall, indicating its great potential as a predictor of establishment means. The scores decrease the lower the taxonomic rank is, with order being the lowest, confirming the suspicion of multicollinearity.
The codes for predicting location is located in the .ipynb file bdcc-project-loc.ipynb.
The fields selected to train the models were kingdom, phylum, class, order, country code, day, month, year, and establishmentmeans. decimallatitude was also added as a feature for predicting decimallongitude, and vice versa. The fields depth and elevation were dropped because they will significantly reduce the number of observations in the dataset. Dropping the irrelevant fields and null values reduced the dataset to 1988949726 observations.
The evaluation scores for the model that utilized all features to predict latitude are presented as the mean absolute error (MAE) and standard error (SE), measuring 2.98 degrees and 20.0 degrees, respectively. Although these values may appear small, it is crucial to consider that a 1-degree difference in latitude corresponds to an approximate surface distance of 111 kilometers. Consequently, the observed error has the potential to span an entire country, highlighting the significance of accuracy in latitude prediction.
Examining Figure 10, the feature importance analysis reveals that the top predictors for latitude are country code and order. The correlation between country and latitude is expected, given their geographical relationship. However, it is worth noting that the high importance assigned to order might be inflated due to multicollinearity issues. To address this, four additional models were trained, each including individual taxonomic orders, while a baseline model focusing solely on country was also developed.
The evaluation scores for these models are presented in Table 2. In contrast to establishment means, the model's performance, as measured by MAE, improves as we move down the taxonomic ranks. However, it is important to consider the possibility of overfitting in these lower ranks. While the MAE for country is smaller than that of the taxonomic ranks, indicating relatively accurate predictions on average, it is worth noting that countries spanning multiple latitudes may experience more significant discrepancies in the model's predictions. On the other hand, kingdom, despite having a higher average prediction error, exhibits a more consistent pattern in terms of SE.
lat_feat_imp = [('class_idx', 0.016451716915352802),
('kingdom_idx', 1.9837409058161797e-05),
('phylum_idx', 0.0009620434045473223),
('order_idx', 0.2118371960094643),
('countrycode_idx', 0.6094754757771987),
('decimallongitude', 0.09032980387574896),
('year', 0.05335175832749377),
('month', 0.0034334978631863043),
('day', 0.012728044829791932),
('establishmentmeans_idx', 0.0014106255881578742)]
df_lat_feat_imp = pd.DataFrame(lat_feat_imp,
columns=['feature', 'score'])
px.bar(df_lat_feat_imp.set_index('feature').sort_values(by='score'),
x='score',
orientation='h',
template='plotly_white',
title='Latitude Feature Importances',
color_discrete_sequence=['orange', 'green'])
| Features | MAE | SE |
|---|---|---|
| All features | 2.98 | 20.0 |
| Country only (baseline) | 2.91 | 20.6 |
| Kingdom only | 13.0 | 1.00 |
| Phylum only | 13.1 | 3.11 |
| Class only | 11.3 | 9.76 |
| Order only | 10.2 | 12.6 |
Using all features, the MAE and SE for predicting longitude were 5.72 and 51.5, respectively. This is significantly higher than that of latitude, which could be explained by the fact that the range of values for longitude (-180 to +180 degrees) is twice as much as the range for latitude (-90 to +90 degrees). Similar to latitude, the top predictors for longitude are Figure 11 shows the feature importance. The top predictors were country code and order.
The evaluation scores are shown in Table 3. While the trend of high MAEs from latitude persists to longitude, the errors seem to be inflated. The ratio of kingdom MAE to baseline MAE is 4.47 for latitude and 7.75 for longitude, indicating that the errors in longitude prediction appear to be more pronounced. These findings suggest that climate differences play a crucial role in shaping the biodiversity patterns observed along latitude. The constraints imposed by distinct climates in regions separated by latitude, such as the equatorial and polar regions, result in a smaller biodiversity range per latitude. In contrast, while longitude can be utilized to broadly differentiate between land and oceanic regions, the predictability of taxonomic ranks decreases. The coarse distinction provided by longitude might not be sufficiently granular to accurately capture the intricate biodiversity patterns that emerge within terrestrial and marine ecosystems. Therefore, other factors or more precise geographical predictors may be required to enhance the prediction accuracy of longitude. These findings highlight the complex interplay between climate differences and biodiversity distribution across latitudinal and longitudinal gradients.
lon_feat_imp = [('class_idx', 0.013755334204592623),
('kingdom_idx', 0.0003252116123264463),
('phylum_idx', 0.0014842530799240232),
('order_idx', 0.18169184100254143),
('countrycode_idx', 0.6926561102192004),
('decimallatitude', 0.09598184853624232),
('year', 0.012299570288871789),
('month', 0.0008011038663212653),
('day', 0.0002742069859476148),
('establishmentmeans_idx', 0.0007305202040319144)]
df_lon_feat_imp = pd.DataFrame(lon_feat_imp,
columns=['feature', 'score'])
px.bar(df_lon_feat_imp.set_index('feature').sort_values(by='score'),
x='score',
orientation='h',
template='plotly_white',
title='Longitude Feature Importances',
color_discrete_sequence=['orange', 'green'])
| Features | MAE | SE |
|---|---|---|
| All features | 5.72 | 51.5 |
| Country only (baseline) | 5.64 | 53.3 |
| Kingdom only | 43.7 | 10.6 |
| Phylum only | 42.7 | 12.9 |
| Class only | 34.8 | 23.0 |
| Order only | 32.1 | 28.4 |
The codes for predicting location is located in the .ipynb file bdcc-project-depth.ipynb.
The fields selected to train the model were kingdom, phylum, class, order, country code, decimallatitude, decimallongitude, day, month, and year. Dropping the irrelevant fields and the null values of latitude and longitude reduced the dataset to 25291164 observations.
Figure 12 presents the feature importance of the model that utilized all features to predict depth. The top predictors were order, country, and longitude. This confirms the previous remarks about the significance of longitudes in separating land from oceans. Table 4 provides the MAEs and SEs for the different models, measured in meters, following the logic of latitude and longitude predictions. Interestingly, unlike longitude and latitude, the MAEs for the various models are now comparable, with taxonomic class even outperforming country. Surprisingly, kingdom still exhibits a significantly smaller standard error, making it the most reliable predictor among the other models. It is important to note that the order of magnitude of the MAEs is acceptable only for organisms that inhabit varying depths, such as pelagic fishes and plankton.
depth_feat_imp = [('class_idx', 0.0630556039772949),
('kingdom_idx', 0.004210919976788162),
('phylum_idx', 0.011154232175912318),
('order_idx', 0.4614366096808723),
('countrycode_idx', 0.2378575889927695),
('decimallatitude', 0.05144196137211182),
('decimallongitude', 0.1394492178853374),
('year', 0.013843160933182923),
('month', 0.007576045976722387),
('day', 0.009974659029008294)]
df_depth_feat_imp = pd.DataFrame(depth_feat_imp,
columns=['feature', 'score'])
px.bar(df_depth_feat_imp.set_index('feature').sort_values(by='score'),
x='score',
orientation='h',
template='plotly_white',
title='Depth Feature Importances',
color_discrete_sequence=['orange', 'green'])
| Features | MAE | SE |
|---|---|---|
| All features | 115 | 202 |
| Country only (baseline) | 125 | 134 |
| Kingdom only | 141 | 56.5 |
| Phylum only | 128 | 113 |
| Class only | 123 | 133 |
| Order only | 127 | 168 |
The codes for predicting location is located in the .ipynb file bdcc-project-depth.ipynb.
The fields selected to train the model were kingdom, phylum, class, order, country code, decimallatitude, decimallongitude, day, month, and year. Dropping the irrelevant fields and the null values of latitude and longitude reduced the dataset to 101071513 observations.
Figure 13 displays the feature importance of the model that utilized all features to predict elevation. The top predictors were country, order, and latitude. In this case, latitude likely plays a crucial role in distinguishing mountainous regions from lowlands, making it a good predictor for elevation. Table 4 showcases the MAEs and SEs of the different models, measured in meters, following the logic of latitude and longitude predictions. Similar to depth, the MAEs for the different models are comparable, while the standard error for kingdom remains significantly lower than the others. However, considering that the highest mountain in the Philippines is around 3000 meters and that landforms lower than 1000 meters are still considered mountains, the order of magnitude of the errors renders the model unusable for accurately predicting elevation. On a positive note, the model can still be employed to predict organisms inhabiting mountainous and lowland regions.
elev_feat_imp = [('class_idx', 0.00681022867351552),
('kingdom_idx', 0.0006023368172190745),
('phylum_idx', 0.003601382852358121),
('order_idx', 0.14586831724145194),
('countrycode_idx', 0.4172593752938513),
('decimallatitude', 0.24892064535477706),
('decimallongitude', 0.11397808965577835),
('year', 0.018381988421197638),
('month', 0.04457763568985092),
('day', 0.0)]
df_elev_feat_imp = pd.DataFrame(elev_feat_imp,
columns=['feature', 'score'])
px.bar(df_elev_feat_imp.set_index('feature').sort_values(by='score'),
x='score',
orientation='h',
template='plotly_white',
title='Elevation Feature Importances',
color_discrete_sequence=['orange', 'green'])
| Features | MAE | SE |
|---|---|---|
| All features | 437 | 305 |
| Country only (baseline) | 507 | 134 |
| Kingdom only | 554 | 72 |
| Phylum only | 549 | 100 |
| Class only | 544 | 128 |
| Order only | 540 | 183 |
In conclusion, this report highlights the potential of leveraging big data and machine learning techniques to improve the Global Biodiversity Information Facility dataset and advance biodiversity research. The study successfully addressed the challenge of missing values in the dataset by proposing a solution that utilized Amazon's cloud computing capabilities and supervised machine learning. The study achieved high accuracy in predicting the establishment field, with taxonomic order and country code identified as the top predictors. While there were concerns about multicollinearity with taxonomic ranks, creating separate models for each rank yielded satisfactory metrics, with kingdom producing scores above 0.95. These findings indicate the potential of taxonomic ranks, particularly kingdom, as valuable predictors of establishment means.
Predicting latitude and longitude proved to be more challenging, with mean absolute errors (MAEs) of 3 and 6 degrees, respectively, and higher standard errors. The top predictors for latitude and longitude were country code and taxonomic order, with country code showing lower MAEs but higher standard errors compared to the taxonomic ranks. On the other hand, kingdom exhibited high MAEs but lower SEs in predicting depth and elevation, suggesting its potential as a predictor for these variables. While the MAEs for latitude and longitude may seem small, they can result in significant location discrepancies, spanning entire countries, which should be considered in practical applications.
The evaluation scores reveal a persistent trend of high MAEs from latitude to longitude, with the errors in longitude prediction appearing more pronounced. This suggests that climate differences play a crucial role in shaping biodiversity patterns along latitude. The distinct climates between equatorial and polar regions impose constraints, resulting in a smaller biodiversity range per latitude. However, the predictability of taxonomic ranks decreases for longitude, as the coarse distinction between land and oceanic regions fails to capture the intricate biodiversity patterns within terrestrial and marine ecosystems.
The feature importance analysis for the model predicting depth reveals that the top predictors are order, country, and longitude. This supports the previous observations regarding the importance of longitudes in distinguishing between land and oceans. Unlike longitude and latitude, the MAEs of the models for depth are comparable, with taxonomic class even surpassing country as a predictor. Interestingly, kingdom demonstrates a significantly lower standard error, making it the most reliable predictor among the other models.
For predicting elevation, country, order, and latitude emerged as the top predictors. Latitude is likely a crucial factor in distinguishing between mountainous regions and lowlands, making it a reliable predictor for elevation. Similar to the depth predictions, the MAEs of the models for elevation are comparable, while the standard error for kingdom remains significantly lower than the others. The model's accuracy diminishes due to the magnitude of the errors, limiting its applicability in accurately predicting precise elevations. Nonetheless, the model can still provide valuable insights for organisms occupying different elevation zones, such as mountainous and lowland regions.
Overall, our study highlights the potential of using machine learning to unlock the value of large-scale biodiversity datasets, paving the way for a better understanding of our planet's diverse ecosystems.
The following recommendations addresses the challenges that the team faced in the course of the study:
Address biases in the data: The study identified geographic bias towards participating countries and taxonomic bias towards birds. It is important to address these biases to ensure the dataset is representative of global biodiversity. Efforts should be made to expand data collection in underrepresented regions and taxonomic groups, fostering a more comprehensive and balanced dataset.
Apply the developed pipeline to other fields in the GBIF dataset: The study focused on predicting establishment means, latitude, longitude, depth, and elevation. However, there are other fields in the GBIF dataset that could benefit from similar predictive modeling approaches. Applying the developed pipeline to these fields, such as species richness or habitat preferences, would improve data completeness and enhance the utility of the dataset for a wider range of research questions.
Explore additional features: While the study considered several important features, there may be other variables in and outside of the GBIF dataset that could improve the accuracy of the predictive models. Exploring and incorporating additional relevant features, such as climate data, habitat characteristics, or ecological interactions, could lead to better predictions and provide a more comprehensive understanding of biodiversity patterns.
Address limitations of the models: The study used default parameters for the random forest algorithm due to time constraints. Also, the scale of the dataset forced the team to use PySpark and its limited set of machine learning capabilities. Further optimization and fine-tuning of the models could potentially improve their performance. Exploring alternative machine learning algorithms, such as gradient boosting or deep learning, could also be beneficial to enhance the predictions.
Enhance data cleaning and preprocessing: The study performed data cleaning and preprocessing to improve data integrity and reliability, but it can be further refined. There is an issues field in the dataset which tags possibly erroneous input of data, as well as an external coordinate cleaner that flags erroneous coordinates [9]. Utilizing these information and implementing more advanced techniques such as outlier detection would further enhance the quality of the dataset and minimize potential biases or errors in the analysis.
Collaborate with domain experts: Involving biodiversity researchers and experts in the field can provide valuable insights and domain-specific knowledge to improve the modeling approaches and interpret the results effectively. As a real test of the models' predictive power, the models can be used to actually predict on the observations with null values, with experts serving as the arbiter of ground truth. Collaboration with experts in taxonomy, ecology, and conservation biology can help refine the selection and interpretation of features, as well as validate and contextualize the model predictions.
Promote open data collaboration: Encourage governments, research institutions, and organizations to contribute their biodiversity data to the GBIF dataset, fostering a collaborative environment for sharing and utilizing comprehensive biodiversity information. Emphasizing the importance of open data and promoting data sharing practices would enable researchers worldwide to access and leverage diverse datasets, leading to more robust and comprehensive analyses of global biodiversity patterns.
[1] GBIF. (n.d.). Retrieved from https://www.gbif.org/
[2] Darwin Core Quick Reference Guide. (2021). Retrieved from https://dwc.tdwg.org/terms/
[3] Rands, M.R.W., Adams, W.M., Bennun, L., Butchart, S.H.M., Clements, A., et al. (2010). Biodiversity Conservation: Challenges Beyond 2010. Science, 329(5997), 1298-1303. https://doi.org/10.1126/science.1189138
[4] Cardinale, B. J., Duffy, E., Gonzalez, A., Hooper, D. U., Perrings, C., et al. (2012). Biodiversity loss and its impact on humanity. Nature, 486(7401), 59-67. http://dx.doi.org/doi:10.1038/nature11148
[5] Dasgupta, P. (2021). The economics of biodiversity: The Dasgupta Review. London: HM Treasury. https://doi.org/10.2458/jpe.2289
[6] GBIF Public Datasets on Amazon Web Services. (n.d.). Retrieved from https://github.com/gbif/occurrence/blob/master/aws-public-data.md
[7] GBIF.org (01 May 2023) GBIF Occurrence Data https://doi.org/10.15468/dl.m64bks
[8] Natural Earth. (n.d.). Retrieved from https://www.naturalearthdata.com/
[9] Zizka, A. et al. CoordinateCleaner: standardized cleaning of occurrence records from biological collection databases. Methods Ecol. Evol. 10, 744–751 (2019). https://doi.org/10.1111/2041-210X.13152